Localizing Policy Gradient Estimates to Action Transitions

نویسندگان

Gregory Z. Grudic

Lyle H. Ungar

چکیده

Function Approximation (FA) representations of the state-action value function Q have been proposed in order to reduce variance in performance gradients estimates, and thereby improve performance of Policy Gradient (PG) reinforcement learning in large continuous domains (e.g., the PIFA algorithm of Sutton et al. (in press)). We show empirically that although PIFA converges significantly faster than traditional PG algorithms such as REINFORCE which directly sample Q (without using FA), FA representations of Q are not necessary to reduce variance in performance gradient estimates, and PG algorithms which use selective direct samples of Q can converge orders of magnitude faster than PIFA. We present a new PG algorithm, called Action Transition Policy Gradient (ATPG), which uses direct samples of Q and restricts estimates of the gradient to coincide with action transitions, thus obtaining relative value estimates of executing actions, without using FA representations of Q. We prove that ATPG gives an unbiased estimate of the performance gradient, and converges to an optimal policy under piece-wise continuity conditions on the policy and the state-action value function. Further, in an experimental comparison with PIFA and REINFORCE, ATPG always outperforms both algorithms, taking orders of magnitude fewer iterations to converge on all but very simple problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Localizing Policy Gradient Estimates to Action

متن کامل

Bayesian Policy Gradient and Actor-Critic Algorithms

Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large num...

متن کامل

Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration

The policy gradient approach is a flexible and powerful reinforcement learning method particularly for problems with continuous actions such as robot control. A common challenge is how to reduce the variance of policy gradient estimates for reliable policy updates. In this letter, we combine the following three ideas and give a highly effective policy gradient method: (1) policy gradients with ...

متن کامل

Expected Policy Gradients for Reinforcement Learning

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussi...

متن کامل

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms*

Gradient based policy optimization algorithms suffer from high gradient variance, this is usually the result of using Monte Carlo estimates of the Qvalue function in the gradient calculation. By replacing this estimate with a function approximator on state-action space, the gradient variance can be reduced significantly. In this paper we present a method for the training of a Gaussian Process t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Localizing Policy Gradient Estimates to Action Transitions

نویسندگان

چکیده

منابع مشابه

Localizing Policy Gradient Estimates to Action

Bayesian Policy Gradient and Actor-Critic Algorithms

Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration

Expected Policy Gradients for Reinforcement Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms*

عنوان ژورنال:

اشتراک گذاری